C'est marrant à quel point l'iconographie du site d'Emmanuel est soignée. Peut-on en faire une gallerie d'images?
On part de cette url : https://en-marche.fr/emmanuel-macron/le-programme
In [2]:
from bs4 import BeautifulSoup
import requests
In [3]:
r = requests.get('https://en-marche.fr/emmanuel-macron/le-programme')
In [4]:
soup = BeautifulSoup(r.text, 'html.parser')
In [5]:
proposals = soup.find_all(class_='programme__proposal')
In [6]:
proposals = [p for p in proposals if 'programme__proposal--category' not in p.attrs['class']]
In [7]:
len(proposals)
Out[7]:
In [8]:
p = proposals[0]
In [9]:
full_url = 'https://en-marche.fr' + p.find('a').attrs['href']
full_url
Out[9]:
In [10]:
full_urls = ['https://en-marche.fr' + p.find('a').attrs['href'] for p in proposals]
In [11]:
full_urls[:10]
Out[11]:
In [12]:
r = requests.get(full_url)
soup = BeautifulSoup(r.text, 'html.parser')
In [13]:
figure_tag = soup.find('figure', class_='fullscreen')
figure_tag
Out[13]:
On peut maintenant extraire le lien vers l'image.
In [14]:
src_url = 'https://en-marche.fr' + figure_tag('img')[0].attrs['src']
src_url
Out[14]:
On peut afficher ceci dans le notebook.
In [15]:
from IPython.display import Image
In [16]:
Image(url=src_url)
Out[16]:
In [17]:
def extract_img_src(url):
"Extracts image src url from linked page."
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
figure_tag = soup.find('figure', class_='fullscreen')
if figure_tag is not None and figure_tag('img') is not None:
src_url = 'https://en-marche.fr' + figure_tag('img')[0].attrs['src']
return src_url
else:
print("no image for url: {}".format(url))
return None
On peut répeter ce processus et faire une gallerie avec toutes ces images.
In [18]:
srcs = [extract_img_src(url) for url in full_urls]
In [19]:
srcs = [_ for _ in srcs if _ is not None]
In [20]:
header = """<!doctype html>
<html lang="fr">
<head>
<meta charset="utf-8">
<title>Gallerie des photos du site d'Emmanuel Macron</title>
<style>
img {width: 100%;}
</style>
</head>"""
In [22]:
def format_as_img_tag(src):
return "<img src={} />".format(src)
In [23]:
format_as_img_tag(srcs[2])
Out[23]:
In [24]:
with open('galerie_macron.html', 'w') as f:
body = """<body>
{0}
</body>""".format("\n".join(format_as_img_tag(url) for url in srcs))
html = header + body + "</html>"
f.write(html)
Ce sont des belles photos...
Depuis la sortie du programme de François Fillon, on peut répéter la démarche.
In [35]:
r = requests.get('https://www.fillon2017.fr/projet/')
soup = BeautifulSoup(r.text, 'html.parser')
In [36]:
tags = soup.find_all('a', class_='projectItem__inner')
In [37]:
sublinks = [tag.attrs['href'] for tag in tags]
On s'attaque aux pages individuelles.
In [39]:
sublinks[0]
Out[39]:
In [38]:
r = requests.get(sublinks[0])
soup = BeautifulSoup(r.text, 'html.parser')
In [48]:
src = soup.find('div', class_='singleProject__banner bannerWithMask backgroundCover').attrs['style'].split("background-image: url(")[1][1:-3]
In [49]:
def extract_img_src(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
src = soup.find('div', class_='singleProject__banner bannerWithMask backgroundCover').attrs['style'].split("background-image: url(")[1][1:-3]
return src
In [51]:
srcs = [extract_img_src(url) for url in sublinks]
In [52]:
srcs
Out[52]:
In [53]:
with open('galerie_fillon.html', 'w') as f:
body = """<body>
{0}
</body>""".format("\n".join(format_as_img_tag(url) for url in srcs))
html = header + body + "</html>"
f.write(html)
In [ ]: